Son Nguyen
Step 1: Standardize the variables
Step 2: Compute the test point's distance from each training point
Step 3: Sort the distances in ascending (or descending) order
Step 4: Use the sorted distances to select the K nearest neighbors
Step 5: Use majority rule (for classification) or averaging (for regression)
The algorithm requires:
\[ d (u, v) = \sqrt{(u_1-v_1)^2 + (u_1-v_1)^2+...+(u_n-v_n)^2}, \] where \[ u = [u_1, u_2,...,u_n] \] and \[ v = [v_1, v_2,...,v_n] \]
| Sex | Age | Class | |
|---|---|---|---|
| u | Male | 27 | A |
| v | Famale | 30 | A |
| Difference | 1 | 3 | 0 |
\[ d(u, v) = \sqrt{1^2+3^2+0^2} = \sqrt{10} \]
Considering the following data:
| Age | Salary ($1000) | |
|---|---|---|
| A | 15 | 90 |
| B | 30 | 80 |
| C | 80 | 87 |
We have \[ AB = d(A, B) = \sqrt{(30-15)^2+(80-90)^2} = 18.03 \] \[ AC = d(A, C) = \sqrt{(80-15)^2+(87-90)^2} = 65.07 \]
Thus, \[ AB < AC \]
However, with the same data
| Age | Salary ($) | |
|---|---|---|
| A | 15 | 90,000 |
| B | 30 | 89,000 |
| C | 80 | 87,000 |
We have \[ AB = d(A, B) = \sqrt{(30-15)^2+(80,000-90,000)^2} = 10,000.01 \] \[ AC = d(A, C) = \sqrt{(80-15)^2+(87,000-90,000)^2} = 3000.70 \]
Thus, \[ AB > AC \]
Not good!
Distances are affected by scaler-multiplication. Hence, the units of the variables will affect the distances.
Standardizing variables will cancel this effect.
\[ \text{Standardized} X = \frac{X-X_{min}}{X_{max}-X_{min}} \]
Standardize the previous data (in either unit for salary):
| Age | Salary ($1000) | |
|---|---|---|
| A | 0 | 1 |
| B | 0.23 | 0.7 |
| C | 1 | 0 |
We have \[ AB = d(A, B) = \sqrt{(0-.23)^2+(1-.7)^2} = 0.38 \] \[ AC = d(A, C) = \sqrt{(0-1)^2+(1-0)^2} = 1.41 \].
Thus, we always have: \[ AB < AC \]
Small K
Large K
\[ \text{Predicted Probability of D} = \frac{w_A \cdot y_A + w_B \cdot y_B +w_C\cdot y_C} {w_A+w_B+w_c} \]
Uniform Weights: All points in each neighborhood are weighted equally when predicting.
If \( A \), \( B \), \( C \) are three nearest neighbors of \( D \), then the predicted probability of the \( D \) by 3NN with uniform weights becomes:
\[ \text{Predicted Probability of D} = \frac{y_A + y_B +y_C} {3} \]
Distance Weights weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away.
If \( A \), \( B \), \( C \) are three nearest neighbors of \( D \), then the predicted probability of the \( D \) by 3NN with distance weights is:
\[ \text{Predicted Probability of D} = \frac{\frac{1}{DA} \cdot y_A + \frac{1}{DB} \cdot y_B +\frac{1}{DC}\cdot y_C} {\frac{1}{DA}+\frac{1}{DB}+\frac{1}{DC}} \]
Distance Weights: The closest neighbor (best friend!) is weighted more than the the two-further-away neighbors, so the weighted vote predicts \( o \).
\[ Weight = \frac{1}{Distance} \]
Use the uniform weights to calculate the predicted probability and the prediction of 3NN for D. Consider \( x \) as 1 and \( o \) as 0.
Use the distance weights to calculate the predicted probability and the prediction of 3NN for D. Consider \( x \) as 1 and \( o \) as 0.